{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Transformation with Amazon a SageMaker Processing Job and Scikit-Learn\n",
    "\n",
    "In this notebook, we convert raw text into BERT embeddings.  This will allow us to perform natural language processing tasks such as text classification.\n",
    "\n",
    "Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.\n",
    "\n",
    "Often, distributed data processing frameworks such as Scikit-Learn are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Scikit-Learn in a managed SageMaker environment to run our processing workload."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NOTE:  THIS NOTEBOOK WILL TAKE A 5-10 MINUTES TO COMPLETE.\n",
    "\n",
    "# PLEASE BE PATIENT."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](img/prepare_dataset_bert.png)\n",
    "\n",
    "![](img/processing.jpg)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. Setup Environment\n",
    "1. Setup Input Data\n",
    "1. Setup Output Data\n",
    "1. Build a Scikit-Learn container for running the processing job\n",
    "1. Run the Processing Job using Amazon SageMaker\n",
    "1. Inspect the Processed Output Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup Environment\n",
    "\n",
    "Let's start by specifying:\n",
    "* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.\n",
    "* The IAM role ARN used to give processing and training access to the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import boto3\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "role = sagemaker.get_execution_role()\n",
    "bucket = sess.default_bucket()\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "sm = boto3.Session().client(service_name=\"sagemaker\", region_name=region)\n",
    "s3 = boto3.Session().client(service_name=\"s3\", region_name=region)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup Input Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store -r s3_public_path_tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    s3_public_path_tsv\n",
    "except NameError:\n",
    "    print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")\n",
    "    print(\"[ERROR] Please run the notebooks in the INGEST section before you continue.\")\n",
    "    print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(s3_public_path_tsv)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store -r s3_private_path_tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    s3_private_path_tsv\n",
    "except NameError:\n",
    "    print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")\n",
    "    print(\"[ERROR] Please run the notebooks in the INGEST section before you continue.\")\n",
    "    print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(s3_private_path_tsv)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "raw_input_data_s3_uri = \"s3://{}/amazon-reviews-pds/tsv/\".format(bucket)\n",
    "print(raw_input_data_s3_uri)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 ls $raw_input_data_s3_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Run the Processing Job using Amazon SageMaker\n",
    "\n",
    "Next, use the Amazon SageMaker Python SDK to submit a processing job using our custom python script."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Review the Processing Script"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pygmentize preprocess-scikit-text-to-bert-feature-store.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run this script as a processing job.  You also need to specify one `ProcessingInput` with the `source` argument of the Amazon S3 bucket and `destination` is where the script reads this data from `/opt/ml/processing/input` (inside the Docker container.)  All local paths inside the processing container must begin with `/opt/ml/processing/`.\n",
    "\n",
    "Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to.  For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name>/`.  You also give the `ProcessingOutput` value for `output_name`, to make it easier to retrieve these output artifacts after the job is run.\n",
    "\n",
    "The arguments parameter in the `run()` method are command-line arguments in our `preprocess-scikit-text-to-bert-feature-store.py` script.\n",
    "\n",
    "Note that we sharding the data using `ShardedByS3Key` to spread the transformations across all worker nodes in the cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Track the `Experiment`\n",
    "We will track every step of this experiment throughout the `prepare`, `train`, `optimize`, and `deploy`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Concepts\n",
    "\n",
    "**Experiment**: A collection of related Trials.  Add Trials to an Experiment that you wish to compare together.\n",
    "\n",
    "**Trial**: A description of a multi-step machine learning workflow. Each step in the workflow is described by a Trial Component. There is no relationship between Trial Components such as ordering.\n",
    "\n",
    "**Trial Component**: A description of a single step in a machine learning workflow. For example data cleaning, feature extraction, model training, model evaluation, etc.\n",
    "\n",
    "**Tracker**: A logger of information about a single TrialComponent.\n",
    "\n",
    "<img src=\"img/sagemaker-experiments.png\" width=\"90%\" align=\"left\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create the `Experiment`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from smexperiments.experiment import Experiment\n",
    "\n",
    "timestamp = int(time.time())\n",
    "\n",
    "experiment = Experiment.create(\n",
    "    experiment_name=\"Amazon-Customer-Reviews-BERT-Experiment-{}\".format(timestamp),\n",
    "    description=\"Amazon Customer Reviews BERT Experiment\",\n",
    "    sagemaker_boto_client=sm,\n",
    ")\n",
    "\n",
    "experiment_name = experiment.experiment_name\n",
    "print(\"Experiment name: {}\".format(experiment_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create the `Trial`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from smexperiments.trial import Trial\n",
    "\n",
    "timestamp = int(time.time())\n",
    "\n",
    "trial = Trial.create(\n",
    "    trial_name=\"trial-{}\".format(timestamp), experiment_name=experiment_name, sagemaker_boto_client=sm\n",
    ")\n",
    "\n",
    "trial_name = trial.trial_name\n",
    "print(\"Trial name: {}\".format(trial_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create the `Experiment Config`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment_config = {\n",
    "    \"ExperimentName\": experiment_name,\n",
    "    \"TrialName\": trial_name,\n",
    "    \"TrialComponentDisplayName\": \"prepare\",\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(experiment_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store experiment_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(trial_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store trial_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create Feature Store and Feature Group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "featurestore_runtime = boto3.Session().client(service_name=\"sagemaker-featurestore-runtime\", region_name=region)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "timestamp = int(time.time())\n",
    "\n",
    "feature_store_offline_prefix = \"reviews-feature-store-\" + str(timestamp)\n",
    "\n",
    "print(feature_store_offline_prefix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group_name = \"reviews-feature-group-\" + str(timestamp)\n",
    "\n",
    "print(feature_group_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.feature_store.feature_definition import (\n",
    "    FeatureDefinition,\n",
    "    FeatureTypeEnum,\n",
    ")\n",
    "\n",
    "feature_definitions = [\n",
    "    FeatureDefinition(feature_name=\"input_ids\", feature_type=FeatureTypeEnum.STRING),\n",
    "    FeatureDefinition(feature_name=\"input_mask\", feature_type=FeatureTypeEnum.STRING),\n",
    "    FeatureDefinition(feature_name=\"segment_ids\", feature_type=FeatureTypeEnum.STRING),\n",
    "    FeatureDefinition(feature_name=\"label_id\", feature_type=FeatureTypeEnum.INTEGRAL),\n",
    "    FeatureDefinition(feature_name=\"review_id\", feature_type=FeatureTypeEnum.STRING),\n",
    "    FeatureDefinition(feature_name=\"date\", feature_type=FeatureTypeEnum.STRING),\n",
    "    FeatureDefinition(feature_name=\"label\", feature_type=FeatureTypeEnum.INTEGRAL),\n",
    "    #    FeatureDefinition(feature_name='review_body', feature_type=FeatureTypeEnum.STRING)\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.feature_store.feature_group import FeatureGroup\n",
    "\n",
    "feature_group = FeatureGroup(name=feature_group_name, feature_definitions=feature_definitions, sagemaker_session=sess)\n",
    "\n",
    "print(feature_group)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Set the Processing Job Hyper-Parameters "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "processing_instance_type = \"ml.c5.2xlarge\"\n",
    "processing_instance_count = 2\n",
    "train_split_percentage = 0.90\n",
    "validation_split_percentage = 0.05\n",
    "test_split_percentage = 0.05\n",
    "balance_dataset = True\n",
    "max_seq_length = 64"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Choosing a `max_seq_length` for BERT\n",
    "Since a smaller `max_seq_length` leads to faster training and lower resource utilization, we want to find the smallest review length that captures `80%` of our reviews.\n",
    "\n",
    "Remember our distribution of review lengths from a previous section?\n",
    "\n",
    "```\n",
    "mean         51.683405\n",
    "std         107.030844\n",
    "min           1.000000\n",
    "10%           2.000000\n",
    "20%           7.000000\n",
    "30%          19.000000\n",
    "40%          22.000000\n",
    "50%          26.000000\n",
    "60%          32.000000\n",
    "70%          43.000000\n",
    "80%          63.000000\n",
    "90%         110.000000\n",
    "100%       5347.000000\n",
    "max        5347.000000\n",
    "```\n",
    "\n",
    "![](img/review_word_count_distribution.png)\n",
    "\n",
    "Review length `63` represents the `80th` percentile for this dataset.  However, it's best to stick with powers-of-2 when using BERT.  So let's choose `64` as this is the smallest power-of-2 greater than `63`.  Reviews with length > `64` will be truncated to `64`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sagemaker.sklearn.processing import SKLearnProcessor\n",
    "\n",
    "processor = SKLearnProcessor(\n",
    "    framework_version=\"0.23-1\",\n",
    "    role=role,\n",
    "    instance_type=processing_instance_type,\n",
    "    instance_count=processing_instance_count,\n",
    "    env={\"AWS_DEFAULT_REGION\": region},\n",
    "    max_runtime_in_seconds=7200,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
    "\n",
    "processor.run(\n",
    "    code=\"preprocess-scikit-text-to-bert-feature-store.py\",\n",
    "    inputs=[\n",
    "        ProcessingInput(\n",
    "            input_name=\"raw-input-data\",\n",
    "            source=raw_input_data_s3_uri,\n",
    "            destination=\"/opt/ml/processing/input/data/\",\n",
    "            s3_data_distribution_type=\"ShardedByS3Key\",\n",
    "        )\n",
    "    ],\n",
    "    outputs=[\n",
    "        ProcessingOutput(\n",
    "            output_name=\"bert-train\", s3_upload_mode=\"EndOfJob\", source=\"/opt/ml/processing/output/bert/train\"\n",
    "        ),\n",
    "        ProcessingOutput(\n",
    "            output_name=\"bert-validation\",\n",
    "            s3_upload_mode=\"EndOfJob\",\n",
    "            source=\"/opt/ml/processing/output/bert/validation\",\n",
    "        ),\n",
    "        ProcessingOutput(\n",
    "            output_name=\"bert-test\", s3_upload_mode=\"EndOfJob\", source=\"/opt/ml/processing/output/bert/test\"\n",
    "        ),\n",
    "    ],\n",
    "    arguments=[\n",
    "        \"--train-split-percentage\",\n",
    "        str(train_split_percentage),\n",
    "        \"--validation-split-percentage\",\n",
    "        str(validation_split_percentage),\n",
    "        \"--test-split-percentage\",\n",
    "        str(test_split_percentage),\n",
    "        \"--max-seq-length\",\n",
    "        str(max_seq_length),\n",
    "        \"--balance-dataset\",\n",
    "        str(balance_dataset),\n",
    "        \"--feature-store-offline-prefix\",\n",
    "        str(feature_store_offline_prefix),\n",
    "        \"--feature-group-name\",\n",
    "        str(feature_group_name),\n",
    "    ],\n",
    "    experiment_config=experiment_config,\n",
    "    logs=True,\n",
    "    wait=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "scikit_processing_job_name = processor.jobs[-1].describe()[\"ProcessingJobName\"]\n",
    "print(scikit_processing_job_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        '<b>Review <a target=\"blank\" href=\"https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}\">Processing Job</a></b>'.format(\n",
    "            region, scikit_processing_job_name\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        '<b>Review <a target=\"blank\" href=\"https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix\">CloudWatch Logs</a> After About 5 Minutes</b>'.format(\n",
    "            region, scikit_processing_job_name\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        '<b>Review <a target=\"blank\" href=\"https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview\">S3 Output Data</a> After The Processing Job Has Completed</b>'.format(\n",
    "            bucket, scikit_processing_job_name, region\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Monitor the Processing Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "running_processor = sagemaker.processing.ProcessingJob.from_processing_name(\n",
    "    processing_job_name=scikit_processing_job_name, sagemaker_session=sess\n",
    ")\n",
    "\n",
    "processing_job_description = running_processor.describe()\n",
    "\n",
    "print(processing_job_description)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "running_processor.wait(logs=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# _Please Wait Until the ^^ Processing Job ^^ Completes Above._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inspect the Processed Output Data\n",
    "\n",
    "Take a look at a few rows of the transformed dataset to make sure the processing was successful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processing_job_description = running_processor.describe()\n",
    "\n",
    "output_config = processing_job_description[\"ProcessingOutputConfig\"]\n",
    "for output in output_config[\"Outputs\"]:\n",
    "    if output[\"OutputName\"] == \"bert-train\":\n",
    "        processed_train_data_s3_uri = output[\"S3Output\"][\"S3Uri\"]\n",
    "    if output[\"OutputName\"] == \"bert-validation\":\n",
    "        processed_validation_data_s3_uri = output[\"S3Output\"][\"S3Uri\"]\n",
    "    if output[\"OutputName\"] == \"bert-test\":\n",
    "        processed_test_data_s3_uri = output[\"S3Output\"][\"S3Uri\"]\n",
    "\n",
    "print(processed_train_data_s3_uri)\n",
    "print(processed_validation_data_s3_uri)\n",
    "print(processed_test_data_s3_uri)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 ls $processed_train_data_s3_uri/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 ls $processed_validation_data_s3_uri/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 ls $processed_test_data_s3_uri/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pass Variables to the Next Notebook(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store raw_input_data_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store max_seq_length"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store train_split_percentage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store validation_split_percentage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store test_split_percentage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store balance_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store feature_store_offline_prefix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store feature_group_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store processed_train_data_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store processed_validation_data_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store processed_test_data_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Query The Feature Store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# feature_store_query = feature_group.athena_query()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# feature_store_table = feature_store_query.table_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# query_string = \"\"\"\n",
    "# SELECT input_ids, input_mask, segment_ids, label_id, split_type  FROM \"{}\" WHERE split_type='train' LIMIT 5\n",
    "# \"\"\".format(\n",
    "#     feature_store_table\n",
    "# )\n",
    "\n",
    "# print(\"Running \" + query_string)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# feature_store_query.run(\n",
    "#     query_string=query_string,\n",
    "#     output_location=\"s3://\" + bucket + \"/\" + feature_store_offline_prefix + \"/query_results/\",\n",
    "# )\n",
    "\n",
    "# feature_store_query.wait()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# feature_store_query.as_dataframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Show the Experiment Tracking Lineage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.analytics import ExperimentAnalytics\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "pd.set_option(\"max_colwidth\", 500)\n",
    "# pd.set_option(\"max_rows\", 100)\n",
    "\n",
    "experiment_analytics = ExperimentAnalytics(\n",
    "    sagemaker_session=sess, experiment_name=experiment_name, sort_by=\"CreationTime\", sort_order=\"Descending\"\n",
    ")\n",
    "\n",
    "experiment_analytics_df = experiment_analytics.dataframe()\n",
    "experiment_analytics_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trial_component_name = experiment_analytics_df.TrialComponentName[0]\n",
    "print(trial_component_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trial_component_description = sm.describe_trial_component(TrialComponentName=trial_component_name)\n",
    "trial_component_description"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Show SageMaker ML Lineage Tracking \n",
    "\n",
    "Amazon SageMaker ML Lineage Tracking creates and stores information about the steps of a machine learning (ML) workflow from data preparation to model deployment. \n",
    "\n",
    "Amazon SageMaker Lineage enables events that happen within SageMaker to be traced via a graph structure. The data simplifies generating reports, making comparisons, or discovering relationships between events. For example easily trace both how a model was generated and where the model was deployed.\n",
    "\n",
    "The lineage graph is created automatically by SageMaker and you can directly create or modify your own graphs.\n",
    "\n",
    "## Key Concepts\n",
    "\n",
    "* **Lineage Graph** - A connected graph tracing your machine learning workflow end to end.\n",
    "\n",
    "* **Artifacts** - Represents a URI addressable object or data. Artifacts are typically inputs or outputs to Actions.\n",
    "\n",
    "* **Actions** - Represents an action taken such as a computation, transformation, or job.\n",
    "\n",
    "* **Contexts** - Provides a method to logically group other entities.\n",
    "\n",
    "* **Associations** - A directed edge in the lineage graph that links two entities.\n",
    "\n",
    "* **Lineage Traversal** - Starting from an arbitrary point trace the lineage graph to discover and analyze relationships between steps in your workflow.\n",
    "\n",
    "* **Experiments** - Experiment entites (Experiments, Trials, and Trial Components) are also part of the lineage graph and can be associated wtih Artifacts, Actions, or Contexts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show Lineage Artifacts For Our Processing Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.lineage.visualizer import LineageTableVisualizer\n",
    "\n",
    "lineage_table_viz = LineageTableVisualizer(sess)\n",
    "lineage_table_viz_df = lineage_table_viz.show(processing_job_name=scikit_processing_job_name)\n",
    "lineage_table_viz_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Release Resources"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%html\n",
    "\n",
    "<p><b>Shutting down your kernel for this notebook to release resources.</b></p>\n",
    "<button class=\"sm-command-button\" data-commandlinker-command=\"kernelmenu:shutdown\" style=\"display:none;\">Shutdown Kernel</button>\n",
    "        \n",
    "<script>\n",
    "try {\n",
    "    els = document.getElementsByClassName(\"sm-command-button\");\n",
    "    els[0].click();\n",
    "}\n",
    "catch(err) {\n",
    "    // NoOp\n",
    "}    \n",
    "</script>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%javascript\n",
    "\n",
    "try {\n",
    "    Jupyter.notebook.save_checkpoint();\n",
    "    Jupyter.notebook.session.delete();\n",
    "}\n",
    "catch(err) {\n",
    "    // NoOp\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
